31 research outputs found
A Computational Theory of the Use-Mention Distinction in Natural Language
To understand the language we use, we sometimes must turn language on itself, and we do this through an understanding of the use-mention distinction. In particular, we are able to recognize mentioned language: that is, tokens (e.g., words, phrases, sentences, letters, symbols, sounds) produced to draw attention to linguistic properties that they possess. Evidence suggests that humans frequently employ the use-mention distinction, and we would be severely handicapped without it; mentioned language frequently occurs for the introduction of new words, attribution of statements, explanation of meaning, and assignment of names. Moreover, just as we benefit from mutual recognition of the use-mention distinction, the potential exists for us to benefit from language technologies that recognize it as well. With a better understanding of the use-mention distinction, applications can be built to extract valuable information from mentioned language, leading to better language learning materials, precise dictionary building tools, and highly adaptive computer dialogue systems.
This dissertation presents the first computational study of how the use-mention distinction occurs in natural language, with a focus on occurrences of mentioned language. Three specific contributions are made. The first is a framework for identifying and analyzing instances of mentioned language, in an effort to reconcile elements of previous theoretical work for practical use. Definitions for mentioned language, metalanguage, and quotation have been formulated, and a procedural rubric has been constructed for labeling instances of mentioned language. The second is a sequence of three labeled corpora of mentioned language, containing delineated instances of the phenomenon. The corpora illustrate the variety of mentioned language, and they enable analysis of how the phenomenon relates to sentence structure. Using these corpora, inter-annotator agreement studies have quantified the concurrence of human readers in labeling the phenomenon. The third contribution is a method for identifying common forms of mentioned language in text, using patterns in metalanguage and sentence structure. Although the full breadth of the phenomenon is likely to elude computational tools for the foreseeable future, some specific, common rules for detecting and delineating mentioned language have been shown to perform well
Automated Ableism: An Exploration of Explicit Disability Biases in Sentiment and Toxicity Analysis Models
We analyze sentiment analysis and toxicity detection models to detect the
presence of explicit bias against people with disability (PWD). We employ the
bias identification framework of Perturbation Sensitivity Analysis to examine
conversations related to PWD on social media platforms, specifically Twitter
and Reddit, in order to gain insight into how disability bias is disseminated
in real-world social settings. We then create the \textit{Bias Identification
Test in Sentiment} (BITS) corpus to quantify explicit disability bias in any
sentiment analysis and toxicity detection models. Our study utilizes BITS to
uncover significant biases in four open AIaaS (AI as a Service) sentiment
analysis tools, namely TextBlob, VADER, Google Cloud Natural Language API,
DistilBERT and two toxicity detection models, namely two versions of
Toxic-BERT. Our findings indicate that all of these models exhibit
statistically significant explicit bias against PWD.Comment: TrustNLP at ACL 202
This Table is Different: A WordNet-Based Approach to Identifying References to Document Entities
Writing intended to inform frequently con-tains references to document entities (DEs), a mixed class that includes orthographically structured items (e.g., illustrations, sections, lists) and discourse entities (arguments, sug-gestions, points). Such references are vital to the interpretation of documents, but they of-ten eschew identifiers such as "Figure 1 " for inexplicit phrases like "in this figure " or "from these premises". We examine inexplicit references to DEs, termed DE references, and recast the problem of their automatic detec-tion into the determination of relevant word senses. We then show the feasibility of ma-chine learning for the detection of DE-relevant word senses, using a corpus of hu-man-labeled synsets from WordNet. We test cross-domain performance by gathering lemmas and synsets from three corpora: web-site privacy policies, Wikipedia articles, and Wikibooks textbooks. Identifying DE refer-ences will enable language technologies to use the information encoded by them, permit-ting the automatic generation of finely-tuned descriptions of DEs and the presentation of richly-structured information to readers.
Effects of Online Self-Disclosure on Social Feedback During the COVID-19 Pandemic
We investigate relationships between online self-disclosure and received
social feedback during the COVID-19 crisis. We crawl a total of 2,399 posts and
29,851 associated comments from the r/COVID19_support subreddit and manually
extract fine-grained personal information categories and types of social
support sought from each post. We develop a BERT-based ensemble classifier to
automatically identify types of support offered in users' comments. We then
analyze the effect of personal information sharing and posts' topical, lexical,
and sentiment markers on the acquisition of support and five interaction
measures (submission scores, the number of comments, the number of unique
commenters, the length and sentiments of comments). Our findings show that: 1)
users were more likely to share their age, education, and location information
when seeking both informational and emotional support, as opposed to pursuing
either one; 2) while personal information sharing was positively correlated
with receiving informational support when requested, it did not correlate with
emotional support; 3) as the degree of self-disclosure increased, information
support seekers obtained higher submission scores and longer comments, whereas
emotional support seekers' self-disclosure resulted in lower submission scores,
fewer comments, and fewer unique commenters; 4) post characteristics affecting
social feedback differed significantly based on types of support sought by post
authors. These results provide empirical evidence for the varying effects of
self-disclosure on acquiring desired support and user involvement online during
the COVID-19 pandemic. Furthermore, this work can assist support seekers hoping
to enhance and prioritize specific types of social feedback
Survey on Sociodemographic Bias in Natural Language Processing
Deep neural networks often learn unintended biases during training, which
might have harmful effects when deployed in real-world settings. This paper
surveys 209 papers on bias in NLP models, most of which address
sociodemographic bias. To better understand the distinction between bias and
real-world harm, we turn to ideas from psychology and behavioral economics to
propose a definition for sociodemographic bias. We identify three main
categories of NLP bias research: types of bias, quantifying bias, and
debiasing. We conclude that current approaches on quantifying bias face
reliability issues, that many of the bias metrics do not relate to real-world
biases, and that current debiasing techniques are superficial and hide bias
rather than removing it. Finally, we provide recommendations for future work.Comment: 23 pages, 1 figur
Unmasking Nationality Bias: A Study of Human Perception of Nationalities in AI-Generated Articles
We investigate the potential for nationality biases in natural language
processing (NLP) models using human evaluation methods. Biased NLP models can
perpetuate stereotypes and lead to algorithmic discrimination, posing a
significant challenge to the fairness and justice of AI systems. Our study
employs a two-step mixed-methods approach that includes both quantitative and
qualitative analysis to identify and understand the impact of nationality bias
in a text generation model. Through our human-centered quantitative analysis,
we measure the extent of nationality bias in articles generated by AI sources.
We then conduct open-ended interviews with participants, performing qualitative
coding and thematic analysis to understand the implications of these biases on
human readers. Our findings reveal that biased NLP models tend to replicate and
amplify existing societal biases, which can translate to harm if used in a
sociotechnical setting. The qualitative analysis from our interviews offers
insights into the experience readers have when encountering such articles,
highlighting the potential to shift a reader's perception of a country. These
findings emphasize the critical role of public perception in shaping AI's
impact on society and the need to correct biases in AI systems
Nationality Bias in Text Generation
Little attention is placed on analyzing nationality bias in language models,
especially when nationality is highly used as a factor in increasing the
performance of social NLP models. This paper examines how a text generation
model, GPT-2, accentuates pre-existing societal biases about country-based
demonyms. We generate stories using GPT-2 for various nationalities and use
sensitivity analysis to explore how the number of internet users and the
country's economic status impacts the sentiment of the stories. To reduce the
propagation of biases through large language models (LLM), we explore the
debiasing method of adversarial triggering. Our results show that GPT-2
demonstrates significant bias against countries with lower internet users, and
adversarial triggering effectively reduces the same.Comment: Paper accepted in the 17th Conference of the European Chapter of the
Association for Computational Linguistics (EACL2023
Understanding How to Inform Blind and Low-Vision Users about Data Privacy through Privacy Question Answering Assistants
Understanding and managing data privacy in the digital world can be
challenging for sighted users, let alone blind and low-vision (BLV) users.
There is limited research on how BLV users, who have special accessibility
needs, navigate data privacy, and how potential privacy tools could assist
them. We conducted an in-depth qualitative study with 21 US BLV participants to
understand their data privacy risk perception and mitigation, as well as their
information behaviors related to data privacy. We also explored BLV users'
attitudes towards potential privacy question answering (Q&A) assistants that
enable them to better navigate data privacy information. We found that BLV
users face heightened security and privacy risks, but their risk mitigation is
often insufficient. They do not necessarily seek data privacy information but
clearly recognize the benefits of a potential privacy Q&A assistant. They also
expect privacy Q&A assistants to possess cross-platform compatibility, support
multi-modality, and demonstrate robust functionality. Our study sheds light on
BLV users' expectations when it comes to usability, accessibility, trust and
equity issues regarding digital data privacy.Comment: This research paper is accepted by USENIX Security '2